Please read all of these instructions before you begin working.
Make a copy of this R Markdown file for this lab. You will edit your copy of this file, and turn it in for this week’s lab.
You must turn in your completed lab by 11:59pm on Saturday, April 20 on D2L.
To receive full credit for this lab, you must submit three files to the appropriate location on D2L: your R Markdown file, the HTML file produced when you “Knit” your R Markdown file, and your R history file.
If you do not submit all three files, what you turned in will be graded and then 50% of the total points you earned for the lab will be subtracted from your score.
Where it says author: “YOUR NAME HERE” in the header of this file (above), replace YOUR NAME HERE with your own name.
If a question asks for an answer that does not involve code, type your answer in below the question.
If a question asks for R code, type your code in an R code chunk below the question.
If any of the code you turn in for the lab produces an error or otherwise does not run correctly, you can still receive partial credit if you include a comment indicating that you are aware of the error, and describing what you tried to do to correct the error and answer the question.
Write a definition for each term below. Use complete sentences.
A corpus is an object that contains many strings and that is annotated with metadata. The metadata associated with it can include information like which document the string is from.
A list of words that are to be excluded from the analysis. Words like “the”, “and”,“I” are very common in all forms of literature and by removing them we can get a better look at what unique words are actually common within texts.
A token is a piece of a larger collection of information. This can be an individual word out of a string. For example if we take a speech and split it up into individual words the speech has been tokenized and a word is a token.
A word cloud is a visualization that is a grouping of words with their size representing their frequency. If the word “Data” is the most common word in a dataset then it will be the largest word in the visualization.
When a new president is elected in the United States, he or she gives a speech at the inauguration ceremony. The data you will be using in this lab contains the text of inaugural addresses from all of the past US Presidents.
The data files you are using in this lab came from the website Kaggle.com. Navigate to the the links below and read the documentation for the two datasets.
Write a few sentences below describing both datasets in your own words:
The Presidential Inaugural Speeches dataset contains data on each Speech given my a president in the United States. Each entry in the dataset includes the president name, what address it was for like the first inaugural address or just inaugural address, the date the gave the speech and then the full transcript of the speech.
The US Presidents dataset however just has one entry per president and some attributes associated with them such as their start and end term time, who the actual president was, what their occupation was before president, their supporting party, and who their vice president was.
The data files need to be loaded and manipulated a bit to prepare them for later analysis in this lab. Run the R code Dr. Rader has provided in the code chunk below to prepare the data for analysis. Make sure you have downloaded the two data files from D2L and saved them in the same folder as this file.
## Parsed with column specification:
## cols(
## Name = col_character(),
## `Inaugural Address` = col_character(),
## Date = col_character(),
## text = col_character()
## )
## Parsed with column specification:
## cols(
## start = col_character(),
## end = col_character(),
## president = col_character(),
## prior = col_character(),
## party = col_character(),
## vice = col_character()
## )
The two data files have been combined into one data table. Answer each questions below about the data table speeches. Use whatever R commands you need to in order to get the answers. Not all of the questions require you to write code.
speeches
Each row in the new tibble is a speech (Inaugural Address) that a president gave in the past.
Each of the columns in the data table include variables that were in the statement “select(president, title, party, line, text)”. These columns include:
text - transcript of the speech
What are the dimensions of the data table?
The data table after manipulation has 5715 speech entries and 5 columns (variables).
Describe what “tidy” text is in your own words:
In the empty R code chunk below, transform the data table speeches into a “tidy” text table, and save it as an object called tidyspeeches. Use the unnest_tokens function. The first argument of unnest_tokens() is what you want the new column containing the words to be called (you should call it word), and the second argument is which column has the text in it that will be broken up into tokens. Note: there is an example of how to do this in the Week 15 lecture R Markdown file, in lines 51-52.
tidyspeeches = speeches %>%
unnest_tokens(word, text)
tidyspeeches
speeches data table. Describe the changes that you made to the data table, in your own words:We tokenized each of the speeches into individual words.
tidyspeeches data table, after you created it in Question 9?the new dataset tidyspeeches has 140036 entries and 5 columns (variables).
In the empty R code chunk below, use the anti_join function along with the stopwords list included in the tidytext library to remove the stop words from the tidyspeeches data table. Save the output of this command back to the tidyspeeches object. Note: there is an example of how to do this in the Week 15 lecture R Markdown file, in line 106.
tidyspeeches = tidyspeeches %>%
mutate(line = row_number()) %>%
anti_join(stop_words) # stop_words is an object in tidytext
## Joining, by = "word"
tidyspeeches
Next, look at the overall word frequencies in the inaugural speech data. Run the code in the code chunk below:
tidyspeeches %>% count(word) %>% arrange(desc(n))
Notice that it looks like the most common words are “government” and “people”. “Country” is up there too. There are also words like “america” and “american”. It seems like there are some words like these that we should probably filter out of the data table, because they are so common across the inauguration speeches.
In an R code chunk below, write a dplyr command to filter out the following words:
government, people, world, country, nation, america, american, americans, united, constitution, citizens
Don’t forget to save the result of the command back to the tidyspeeches object.
speeches_stopwords <- tibble(word = c("government", "people", "world", "country", "nation", "america", "american", "americans", "united", "constitution", "citizens"))
tidyspeeches = tidyspeeches %>%
anti_join(speeches_stopwords) # stop_words is an object in tidytext
## Joining, by = "word"
tidyspeeches
tidyspeeches %>% count(word) %>% arrange(desc(n))
Starting in line 131 of the Week 15 Lecture R Markdown file, there are two examples of how to make word clouds out of “tidy” text data. Following those examples, create word clouds for the following presidential inauguration speeches:
tidyspeeches %>%
filter(president == "George Washington") %>%
filter(title == "First Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
tidyspeeches %>%
filter(president == "Abraham Lincoln") %>%
filter(title == "First Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): constitutional could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): expressly could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): laws could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): section could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): power could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): enforced could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): property could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): purpose could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): fugitive could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): service could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): friends could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): intercourse could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): congress could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): institutions could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): parties could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): offices could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): contract could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): oath could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): written could not be fit on
## page. It will not be plotted.
tidyspeeches %>%
filter(president == "Franklin D. Roosevelt") %>%
filter(title == "Third Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): democracy could not be fit
## on page. It will not be plotted.
tidyspeeches %>%
filter(president == "John F. Kennedy") %>%
filter(title == "Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): pledge could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): president could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): generation could not be fit
## on page. It will not be plotted.
tidyspeeches %>%
filter(president == "Barack Obama") %>%
filter(title == "Second Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): generation could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): god could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): time could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): freedom could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): meaning could not be fit on
## page. It will not be plotted.
tidyspeeches %>%
filter(president == "Donald J. Trump") %>%
filter(title == "Inaugural Address") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): millions could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): countries could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): dreams could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): borders could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): families could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): factories could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): foreign could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): obama could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): power could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): jobs could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): bring could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): heart could not be fit on
## page. It will not be plotted.
Recall that sentiment analysis requires a dictionary of words and their associated “sentiment”, like positive, negative, or neutral. Fortunately, the tidytext library makes sentiment analysis very easy, because it has a couple of built-in sentiment dictionaries.
In the empty R code chunk below, use the “bing” sentiment dictionary that is part of the tidytext library to add a sentiment column to the tidyspeeches data table. Save this as a new object called sentiments. Note: there is an example of how to do this starting in line 157 of the Week 15 lecture R Markdown file.
sentiments = tidyspeeches %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
sentiments
Graph the percent of positive and negative words in each of the six inaugural speeches you made word clouds out of, above:
The beginning of the R command is included for you below. You need to finish it. Make sure that there is one facet per president. Note: there is an example of how to do this starting in line 183 of the Week 15 lecture R Markdown file.
sentiments %>%
filter((president == "George Washington" & title == "First Inaugural Address") |
(president == "Abraham Lincoln" & title == "First Inaugural Address") |
(president == "Franklin D. Roosevelt" & title == "Third Inaugural Address") |
(president == "John F. Kennedy" & title == "Inaugural Address") |
(president == "Barack Obama" & title == "Second Inaugural Address") |
(president == "Donald J. Trump" & title == "Inaugural Address")) %>%
group_by(president, sentiment) %>%
summarize(n = n()) %>%
mutate(percent = n / sum(n) * 100) %>%
ggplot(aes(x = sentiment, y = percent, fill = sentiment)) +
geom_bar(stat = "identity") +
facet_wrap(~president) +
labs(x = "President",
y="Percentage of Positive and Negative Words",
title="Proportions of Positive and Negative Words in Presidential Inagural Speeches")
Starting on line 306 in the Week 15 lecture R Markdown file, there is an example of how to calculate TF-IDF for a corpus, and then graph the high TF-IDF words in each of six Shakespeare plays.
Following this example, calculate TF-IDF for six inaugural speeches from the tidyspeeches data table. You can use the same six as above, or explore other speeches, it is your choice. Because titles are not unique identifiers of speeches, the data table needs a unique identifier for each speech. This command is started for you below, including the code to create a unique identifier for each speech, speech.
speeches_tfidf <- tidyspeeches %>%
mutate(speech = paste(president, title))
Then, following the example starting in line 333 in the Week 15 lecture R Markdown file, make a faceted bar graph like the one in that file showing the high TF-IDF words in each of the six speeches you have selected to graph. This is started for you below (assuming you want to use the same six speeches as above).
tfidf_ranks <- speeches_tfidf %>%
filter(speech %in% c("George Washington First Inaugural Address",
"Abraham Lincoln First Inaugural Address",
"Franklin D. Roosevelt Third Inaugural Address",
"John F. Kennedy Inaugural Address",
"Barack Obama Second Inaugural Address",
"Donald J. Trump Inaugural Address")) %>%
group_by(speech)